Abstract
Lowering the haemoglobin cut-off for polycythaemia vera (PV) has markedly increased suspected PV referrals. Polycythemia Vera (PV) is a myeloproliferative neoplasm (MPN) marked by elevated red blood cell mass, typically driven by the JAK2 V617F mutation; yet a substantial proportion—particularly among Indians—are JAK2-negative (Bist A et al., Asian J Oncol 2022), leaving their molecular basis understudied. This diagnostic gap delays accurate detection and targeted treatment, underscoring the need to characterize further genomic contributors to PV. Identifying with this goal, we explored an interpretable deep-learning framework trained on whole-exome sequencing (WES) data represented using a novel tensorization method.
We analyzed WES data from an Indian cohort comprising 35 JAK2-negative PV cases, 16 JAK2-positive PV cases and 45 matched controls. For Variant call format (VCF) representation, the creation of extensive literature-curated gene lists that prior tensorization methods rely on is precluded by the limited research on JAK2-negative PV. Further, this dependence restricts the identification of novel disease-associated genes as it biases the representation towards well-characterized genes. To circumvent these limitations, we derived an unbiased gene list (n = 339) by selecting rare variants present in over 75 % of JAK2-negative samples with pLi (probability of being loss-of-function intolerant) > 0.5. We further modify previous tensorization methods by incorporating features encoding clinical significance and other variant-level scores. The transformation of each sample's mutational profile into a 2D tensor facilitated the application of convolutional neural networks (CNNs). CNNs were trained exclusively on JAK2-negative cases to identify genomic drivers in this subtype. To mitigate overfitting and class imbalance, we employed dropout regularization and Synthetic Minority Over-sampling Technique (SMOTE) on autoencoder-derived latent embeddings. We used Deep SHapley Additive exPlanations (DeepSHAP) to attribute classification decisions to gene-level features, ensuring model interpretability.
The model demonstrated robust performance on testing data (area under the receiver operating characteristic curve ~ 0.95, recall = 1.0 for JAK2-negative cases, accuracy ~ 94%). The model further generalized well to JAK2-positive cases, misclassifying only two samples. Interestingly, this was achieved despite exclusion of JAK2-positive cases from the training set, and the absence of JAK2 in the gene list. This indicates that broader PV-related genomic signatures beyond the canonical driver mutation were captured.
Analysis of the top 20 % SHapley Additive exPlanations (SHAP) scores reveals genes previously associated with PV, (e.g., KCNMA1, MECOM) and several (10+) genes involved in the Essential Thrombocythemia–PV–Acute Myeloid Leukemia disease continuum (e.g., DGKB, LTBP1, NUMA1) were assigned high attribution scores, validating the biological relevance of this pipeline. Importantly, we find several novel associations previously not linked to PV. PTPRD, a recognized tumor suppressor involved in JAK-STAT regulation, had the highest SHAP attribution, highlighting a potential novel mechanistic role in PV pathogenesis. Other high-ranking genes (e.g., GRAMD1B, FHOD3, PTPRM) are associated with the JAK-STAT pathway. Genes involved in other blood cancers such as ANK2, HNRNPH1, BIRC6 and other malignancies were also identified similarly. These insights can inform biomarker studies and identify potential pathways implicated in PV pathogenesis.
In conclusion, we report possibly the first description of an interpretable convolutional neural network (CNN) trained on tensorized whole-exome sequencing (WES) data for malignancy classification. Though validation on independent cohorts is underway, current findings are promising and demonstrate strong predictive performance, suggesting potential to improve early detection and diagnostic pipelines. This pipeline accurately predicts PV, recovers known genomic markers, and identifies several novel associations. By recovering known PV-associated genes, our approach lends support to several novel genomic associations, contributing to mechanistic understanding of both PV subtypes and informing biomarker discovery. More broadly, this novel literature-agnostic approach could be adapted to study other genetically understudied diseases with polygenic or unclear molecular etiologies.
This feature is available to Subscribers Only
Sign In or Create an Account Close Modal